New Package In Development: Snakes

tl:dr snakes is a package containing data on individual snake species, from the World Health Organization Snake Database. You can get an early version from “devtools::install_github(”callumgwtaylor/snakes“)”

Recently for an assignment for my masters I was looking at the global burden of snakebite. Keen to get some pretty diagrams in to distract the reviewers from my terrible writing, I went online to find a resource we could analyse in R. This had worked well enough for previous essays, especially with the existence of The Humanitarian Data Exchange, plus a mini package I’d written for my own purposes to import their data easily.

However after a brief search, the only online resource I could see was the WHO Snake and Antivenoms Database. Most organisations are keen to get you to have a delve into what they have, recently I’ve used data from The Global Terrorism Database and the International Disasters Database. Both required a bit of registration but ultimately sent you a csv download.

The WHO Database however, looked like this: And once you selected a snake, you got this:

Now, I really like this set up, there’s a clear layout of information for each snake on their individual page. You get a map and an image, and most importantly, which antivenoms exist. However, if you want to make comparisons between lots of snakes, that becomes a bit more difficult. If you want to explore which snakes don’t have antivenoms, or the global distibution of antivenom producers, or anything else then there’s no straightforward way to do so without a lot of loading individual pages to take information down.

At this point, I realised that me working out how to extract any info would definitely allow me to procrastinate beyond the essay deadline, so I gave up. But I did want to learn how to do some basic scraping if I bumped into a similar issue in the future. Using this article on rvest, plus this talk on purrr and rvest, I’ve downloaded the data for each species of medically important snakes from the WHO database.

Now if you load the snakes package from github you can save yourself the hassle of what I did:

devtools::install_github("callumgwtaylor/snakes")
library(snakes)
snake_species <- snakes::snake_species_data

snake_species_data will give you:

  • The Identifier the WHO uses for the snake species
  • Snake Common Name
  • Snake Species Name
  • Snake Family
  • A link to the WHO map of snake distribution
  • A link to the legend for the WHO map
  • A link to a WHO picture of the snake
  • The first other common name for the snake
  • Any other common names for the snake
  • Any previous names for the snake
  • A nested data_frame, containing the regions and subregions the snake is found in
  • A nested data_frame, containing the snake antivenom product names, manufacturers, and countries of origin

This is the first version of the package containing what I’ve put together today. I hope to update it with some more useful geographical information. The database does have individual country pages, so a data_frame including snakes in each country is the next target.

Lastly, this is clearly not my data and I make no claims of ownership whatsover. The WHO are the copyright holders for any data, and whilst I think this package comes under acceptable use for research, please let me know if you’re someone from the WHO who disagrees.

Actually lastly, this data is extracted from the WHO database, but I am not making any claims about their accuracy, so this information should not be used to make any clinical decision on use of antivenom, or any similar decisions

The State of Cholera in Yemen in December

Cholera in Yemen : UPDATED 16-DECEMBER-2017

At the time of writing, the WHO has released information about 959810 cases of cholera in Yemen, with 2219 deaths. This post uses data released on 2017-11-26. This information has been collated by the Humanitarian Data Exchange (HDX), and put online. All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Cholera Cases

Total number of cases of cholera in each governate in Yemen

Daily number of new cases of cholera in each governate in Yemen

Cholera Deaths

Total number of deaths from cholera in each governate in Yemen

Country Level

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 417 106933
Ibb 284 59932
Al Hudaydah 271 139145
Taizz 184 58223
Amran 174 94581
Dhamar 160 90560
Al Mahwit 148 56447
Sana’a 122 68453
Raymah 117 14497
Al Dhale’e 81 47004
Amanat Al Asimah 70 91799
Aden 62 20286
Abyan 35 28103
Al Bayda 33 26793
Al Jawf 22 14689
Lahj 21 22596
Marib 7 6897
Sa’ada 5 9722
Shabwah 3 1396
Hadramaut 2 587
Al Maharah 1 1167
Socotra NA NA

Cholera Cases Table

Administrative District Deaths Cases
Al Hudaydah 271 139145
Hajjah 417 106933
Amran 174 94581
Amanat Al Asimah 70 91799
Dhamar 160 90560
Sana’a 122 68453
Ibb 284 59932
Taizz 184 58223
Al Mahwit 148 56447
Al Dhale’e 81 47004
Abyan 35 28103
Al Bayda 33 26793
Lahj 21 22596
Aden 62 20286
Al Jawf 22 14689
Raymah 117 14497
Sa’ada 5 9722
Marib 7 6897
Shabwah 3 1396
Al Maharah 1 1167
Hadramaut 2 587
Socotra NA NA

The State of Cholera in Yemen in November

Cholera in Yemen : UPDATED 16-NOVEMBER-2017

At the time of writing, the WHO has released information about 913741 cases of cholera in Yemen, with 2196 deaths. This post uses data released on 2017-11-08. This information has been collated by the Humanitarian Data Exchange (HDX), and put online. All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Cholera Cases

Total number of cases of cholera in each governate in Yemen

Daily number of new cases of cholera in each governate in Yemen

Cholera Deaths

Total number of deaths from cholera in each governate in Yemen

Country Level

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 414 100850
Ibb 282 57136
Al Hudaydah 268 131827
Taizz 184 54422
Amran 170 89729
Dhamar 157 84741
Al Mahwit 145 52920
Sana’a 122 66086
Raymah 116 13903
Al Dhale’e 81 46721
Amanat Al Asimah 68 87578
Aden 62 19816
Abyan 35 27957
Al Bayda 31 25905
Al Jawf 22 13722
Lahj 21 22524
Marib 7 6102
Sa’ada 5 8662
Shabwah 3 1390
Hadramaut 2 586
Al Maharah 1 1164
Socotra NA NA

Cholera Cases Table

Administrative District Deaths Cases
Al Hudaydah 268 131827
Hajjah 414 100850
Amran 170 89729
Amanat Al Asimah 68 87578
Dhamar 157 84741
Sana’a 122 66086
Ibb 282 57136
Taizz 184 54422
Al Mahwit 145 52920
Al Dhale’e 81 46721
Abyan 35 27957
Al Bayda 31 25905
Lahj 21 22524
Aden 62 19816
Raymah 116 13903
Al Jawf 22 13722
Sa’ada 5 8662
Marib 7 6102
Shabwah 3 1390
Al Maharah 1 1164
Hadramaut 2 586
Socotra NA NA

In Yemen, cases of cholera and deaths continue to rise

Cholera in Yemen : UPDATED 22-JULY-2017

At the time of writing, the WHO has released information about 368207 cases of cholera in Yemen, with 1828 deaths. This post uses data released on 2017-07-19. This information has been collated by the Humanitarian Data Exchange (HDX), and put online. All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Over the last fortnight the rate of new cases of cholera has slowed slightly, but we’re still seeing more than 5000 new cases daily. Whilst numbers of new diagnoses of cholera are dropping in regions like Sana’a, more badly affected governates like Al Hudaydah show no signs of slowing down. The mortality rate is staying relatively static, with one in every 200 cases of cholera resulting in death.

According to Oxfam, the total number of cases of cholera in Yemen could double with the rainy season to over 600,000. Keeping our current mortality rates, that's 3000 deaths from a preventable illness.

Cholera Cases

Total number of cases of cholera in each governate in Yemen

Daily number of new cases of cholera in each governate in Yemen

Cholera Deaths

Total number of deaths from cholera in each governate in Yemen

Country Level

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 353 38936
Ibb 233 28250
Al Hudaydah 212 45580
Taizz 161 25927
Amran 150 37814
Dhamar 124 26343

Cholera Cases Table

Administrative District Deaths Cases
Amanat Al Asimah 60 47647
Al Hudaydah 212 45580
Hajjah 353 38936
Amran 150 37814
Ibb 233 28250
Dhamar 124 26343

Cholera in Yemen is getting worse

Publicly available data about the cholera crisis in Yemen is provided through bulletins from the World Health Organistion. Released online, each one gives the most recent running total of cholera cases and deaths in each part of the country. You can get a copy yourself here

The code to recreate this map and the plot is available here

The map below shows the most recent numbers. Currently it’s the more densely populated areas of the West coast that are most badly affected, particularly Sana’a and Al Hudaydah.

However what this map doesn’t show easily, is that things seem to be getting worse.

If we take each bulletin’s total number of cases and subtract the previous cases (and adjust for the fact we sometimes have to wait a few days for an update), it seems that the number of new cases of cholera is increasing in several areas. Al Hudaydah had around 250 cases per day at the start of the epidemic, but at the most recent bulletin we saw a 1000 new cases.

In fact in the most recent bulletin: Al Hudaydah, Hajjah, and Al Dhale’e have all averaged over 1000 new cases per day since the bulletin before.

Cholera in Yemen - mapping deaths from the current epidemic using HDX in R

Cholera in Yemen : UPDATED 12-JULY-2017

At the time of writing this, the WHO has released information about 320199 cases of Cholera in Yemen, with 1742 deaths.

This information has been collated by the Humanitarian Data Exchange (HDX), and put online. I made a new function for hdxr the other day to make it easier to use maps from HDX, and wanted to learn more about what’s happening in Yemen.

All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Cholera Deaths

Cholera Cases

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 338 35310
Ibb 227 25433
Al Hudaydah 199 38942
Taizz 150 22903
Amran 149 32625
Dhamar 114 20848

Cholera Cases Table

Administrative District Deaths Cases
Amanat Al Asimah 56 42765
Al Hudaydah 199 38942
Hajjah 338 35310
Amran 149 32625
Ibb 227 25433
Sana’a 111 24360

Downloading data from HDX easily in a tidy format - hdxr and hdx_resource_csv

An easier pipeline for Humanitarian Data Exchange, with a tidy format

I’ve been trying to make it easier to extract data from HDX in R, using tidyverse and ropensci packages. I’ve started compiling a mini-package called hdxr that wraps these pipelines up neatly.

I’ve added a new function today hdx_resource_csv that means to download datasets from HDX all you need to do is the following:

library(hdxr)
hdx_connect()
datasets <- hdx_package_search(term = "data title") %>%
  hdx_resource_list() %>%
  hdx_resource_csv()

I’ve described most of these functions in previous posts, but basically:

hdx_connect uses ckanr to connect to the hdx ckan server

hdx_package_search will search hdx for the packages you’re looking for and return a dataframe. (You can use hdx_list to find titles of datasets)

hdx_resource_list will take that dataframe and use tidyverse features to extract information about the datasets themselves.

hdx_resource_csv

The new function hdx_resource_csv will take the results of hdx_resource_list, and return a new dataframe. This will have three columns:

  • identifier a title merged from the package and dataset titles
  • location the url where the csv was downloaded from
  • csv a nested dataframe downloaded from the location url

hdx_resource_csv provides a nested dataframe column to allow a sensible output when you download multiple csvs all with different columns in one go.

When you want to extract a particular csv to work with, you can use dplyr::filter() and tidyr::unnest() to get at it. You could just unnest without filtering, but when every csv has different column titles, the output is a bit messy.

Example:

library(tidyverse)
library(hdxr)
hdx_connect()
datasets <- hdx_package_search(term = "141121-sierra-leone-health-facilities") %>%
  hdx_resource_list() %>%
  hdx_resource_csv()

The above would give us a dataframe that looks like:

you can then select what you want with filter and unnest it:

sierra_leone_healthsites_sbtf_sle_health <- datasets %>%
  filter(dataset_identifier == "sierra-leone-healthsites_sbtf-sle-health") %>%
  unnest()

Any use? Wrong way of doing things?

I think this pipeline makes sense for me to extract data from HDX. And the majority of files on there are CSV. If you’ve thoughts of how I should do it differently, please let me know either on github directly or just through twitter

If you want to try it yourself, download and installation instructions are on the github readme.

Using sf, gganimate, and the Humanitarian Data Exchange to map ACLED data for Africa

Mapping fatalities from violence in Africa in 2017

In my previous post I showed how I was trying to get data directly from the Humanitarian Data Exchange (HDX), in an R pipeline.

Here’s a quick example of what it allows you to do. The visualisations in this aren’t pretty, it’s mainly meant as an example of what you can easily do. I’m really enjoying using sf to plot maps in ggplot2, thanks to Matt Strimas-Mackey showing how to get that working.

The code to run this example is on my github in an R markdown file.

These graphs use two datasets, the ACLED files were downloaded directly from HDX. I then was able to filter out just events with fatalities, and just for 2017. The shapefiles are from the rnaturalearth package.

DISCLAIMER: I do not know how complete this ACLED dataset is, and do not want to pretend it paints a full and accurate picture about a subject I have no expertise in!

Static map of fatalities from political violence

gganimate plot of fatalities from political violence, seperated monthly

Getting data from Humanitarian Data Exchange in a reproducible R pipeline

Update: I’ve updated a couple of these functions, as they were messy and unreliable (they probably still are!). I’ve also put them together as a mini-package that you can install from github: hdxr. The code below is the OLD version, the up to date versions are on github. If you’ve ideas on how I should do it better, please let me know on github/twitter.

I’ve been trying to make it easier for me to get information from the Humanitarian Data Exchange. The folk who run the centre have been trying to make it easier to access too, they’ve released a python api, and they use CKAN so you can download JSON information about each ‘package’ (basically the subject of the data) and each ‘resource’ (the data itself). The problem for me is I don’t know python, I’m not comfortable with CKAN, and struggle with JSON also.

All I wanted to be able to do was search for a package, and download a specific resourse, in a reproducible and roughly tidy-ish fashion

I’ve made a couple of functions to make it easier to do so. I’ve been using ckanr and jsonlite

So here are the functions I’ve made, and an example of using them:

Functions for interacting with the HDX CKAN

library(ckanr)
library(tidyverse)
library(jsonlite)

hdx_connect()

This will use the ckanr package to connect to the HDX ckan server.

# This creates a function to connect to the hdx server
hdx_connect <- function(){
ckanr_setup(url = "http://data.humdata.org/")
}

hdx_list()

This function takes one argument, limit. It will return a list of HDX packages, depending on the limit set. There’s currently almost 5000 packages.

# This function will create a data frame of the length set in x, listing the packages in x
hdx_list <- function(limit){
package_list(as = 'table', limit = limit) %>%
as_data_frame(.)
}

hdx_resource_list()

To see the exact resources available in an easier-to-read dataframe they need unnesting. When this function is used on the results of hdx_package_search(), it will extract a resources dataframe, then left join the results onto the original dataframe provided to it.

# This function will take the results of a package_search and extract the resources, it will then link those resources to the results from package search giving a new data frame
hdx_resource_list <- function(package){
package$resources %>%
    as.data.frame(.) %>%
    left_join(package, ., by = c("id" = "package_id")) %>%
    select(-resources)
}

Getting data off of HDX

Once you’ve used all the functions, you should have a dataframe with titles for the packages, and urls for the resources. You can then use httr or readr to download the files and bring them into your work.

Example: Exploring the ACLED Conflict Data for Africa

In this example I am going to identify the package “ACLED Conflict Data for Africa (Realtime - 2017)” here. It has an excel file, and a zipped csv. I’ll then identify the resource for the unzipped excel file.

Once we have that, we can use httr to download the file, and readxlto load it in.

library(ckanr)
library(tidyverse)
library(jsonlite)
library(readxl)
library(httr)
# First we connect to HDX

hdx_connect()

# We can list all packages available for us

hdx_list(5000)
## # A tibble: 4,905 x 1
##                                                                          value
##                                                                          <chr>
##  1                                       141121-sierra-leone-health-facilities
##  2                                      160516-ecuador-earthquake-4w-1st-round
##  3                                                      160523-ocha-4w-round-2
##  4                                                     160625-hrrp-4w-national
##  5 1999-2013-tally-of-internaly-displaced-persons-resulting-from-natural-disas
##  6                                                                  2011-nepal
##  7                                       2012-census-tanzania-wards-shapefiles
##  8                                        2014-2015-food-security-ipc-analysis
##  9                         2014-nutrition-smart-survey-results-and-2015-trends
## 10                                 2015-humanitarian-needs-overview-indicators
## # ... with 4,895 more rows
# We then search for the package we want, and use dplyr to filter it to the exact package

hum_data_packages <- hdx_package_search("ACLED Conflict Data for Africa") %>%
  filter(title == "ACLED Conflict Data for Africa (Realtime - 2017)")

# We then expand our resources from the search result, and look for the unzipped excel file

hum_data_resources <- hdx_resource_list(hum_data_packages) %>%
  filter(format == "XLSX")
url <- hum_data_resources$hdx_rel_url
GET(url, write_disk("dataset.xlsx", overwrite=TRUE))
## Response [http://www.acleddata.com/wp-content/uploads/2017/06/ACLED-All-Africa-File_20170101-to-20170617.xlsx]
##   Date: 2017-07-03 12:21
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 1.86 MB
## <ON DISK>  C:\Users\callu\Dropbox\blog\content\post\dataset.xlsx
ACLED_CONFLICT_DATA <- read_excel("dataset.xlsx", col_names = TRUE)
rm(url)

Mini Package: hdxr

Mini package - Mini post

I realised that when I wrote this post, I should do something more sensible to share the code then put the text up in a blog. Writing a package seemed overkill though, and something you only do for ‘real code’.

But curiousity got the better of me, and after reading the go-to post for package writing, I thought I might as well try.

So here we are: hdxr

It has the functions I mentioned before in the above post

hdx_connect() hdx_list() hdx_package_search() hdx_resource_list()

How to use them is explained in the github repo and blog post above. There has been no testing whatsoever and I bet they wont work first try for you. If you want to let me know how to improve them or how to use ckanr better then I can be contacted on github or twitter.

Plotting deprivation in Scotland, using geofacet and sf in R

Using the geofacet package to plot deprivation in Scotland by Health Boards

A few months ago, when I was first starting to learn to use R, I tried looking at the data from the Scottish Index of Multiple Deprivation. The Scottish Government split Scotland up into 6976 equally populated “data zones” (not quite neighbourhoods, but pretty close), and ranked them from most deprived (1), to least deprived (6976)

Recently I’ve gone back to the same files, to see if what I’ve learnt has made it easier to look at deprivation in Scotland.

This weekend I found out about a new-ish package, geofacet from Ryan Hafen. Luckily for me, Joseph Adams had already submitted a grid for Scottish Health Boards, making it so easy to plot this all out.

I’ve included the code I used to make the plot.

library(tidyverse)
library(sf)
library(readxl)
library(geofacet)
map_scot <- st_read("../data/scot_gov_data/data_zone_shapefiles/.")
## Reading layer `SG_SIMD_2016' from data source `C:\Users\callu\Dropbox\blog\content\data\scot_gov_data\data_zone_shapefiles' using driver `ESRI Shapefile'
## Simple feature collection with 6976 features and 49 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 5513 ymin: 530252.8 xmax: 470323 ymax: 1220302
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs
data_postcode_simd <- read_excel("../data/scot_gov_data/00505244.xlsx", sheet = 2)
data_simd_ranks <- read_excel("../data/scot_gov_data/00512735.xlsx", sheet = 3)

data_simd_ranks$HBname[data_simd_ranks$HBname == "Western Isles"] <- "Western Isle"

Deprivation in Scotland by Health Boards

So looking at this, brighter colours equal lower average levels of deprivation. A distribution towards the left shows a healthboard with a greater proportion of deprived datazones.

Using the geofacet package, we swap out a normal facet_wrap() for facet_geo(), and tell it what layout we want grid=("nhs_scot_grid"). The only problem is that this grid has “Western Isles” saved as “Western Isle”, so we have to rename out own data to match this. Now our health boards are placed in roughly a geographical order.

data_simd_ranks %>%
  select(DataZone = DZ, HBname) %>%
  left_join(map_scot, .) %>%
  group_by(HBname) %>%
  mutate(median_deprivation = median(Rank)) %>%
  ungroup() %>%
  ggplot() +
  geom_histogram(aes(x = Rank, y = ..ncount.., fill = median_deprivation), binwidth = 100) +
  facet_geo(~HBname, grid = "nhs_scot_grid") +
  labs(title = "Deprivation in Health Boards in Scotland",
       x = "Relative deprivation of data zones (left = more deprived neighbourhoods)",
       y = "Proportion of data zones",
       caption = "Darker colour shows increased average deprivation of healthboard") + 
  theme_bw() +
  theme(legend.position = "none") 

Glasgow versus Edinburgh

In the plot above, the west coast / east coast divide is looking pretty big in the central belt of Scotland. The plot below shows this even more. Glasgows neighbourhoods are massively skewed towards some of the most deprived parts of Scotland, whereas in Edinburgh we see the opposite. Obviously Glasgow still has some very well off parts, and Edinburgh has some deprived areas, but in this graph they seem to be polar opposites.

city_data <- data_simd_ranks %>%
  select(DataZone = DZ, LAname) %>%
  left_join(map_scot, .) %>%
  group_by(LAname) %>%
  filter(LAname == "Glasgow City" | LAname == "City of Edinburgh") %>%
  mutate(median_deprivation = median(Rank)) %>%
  ungroup()

city_data$LAname <- parse_factor(city_data$LAname, levels = c("Glasgow City", "City of Edinburgh"))

  ggplot(city_data) +
  geom_histogram(aes(x = Rank, y = ..ncount.., fill = median_deprivation), binwidth = 100) +
  facet_wrap(~LAname) +
  labs(title = "Deprivation in Glasgow and Edinburgh",
       x = "Relative deprivation of data zones",
       y = "Proportion of data zones",
       caption = "Distribution to the left shows increased deprivation in city, darker colour shows increased average deprivation of city") + 
  theme_bw() +
  theme(legend.position = "none") 

About

This shows what I’m procrastinating with and pretending counts as work.

This is a personal site built using the blogdown package. The theme is from @spf13/hyde

Archive

Blog Posts